Investigating Distance Metrics in Semi-supervised Fuzzy c-Means for Breast Cancer Classification
نویسندگان
چکیده
In previous work, semi-supervised Fuzzy c-means (ssFCM) was used as an automatic classification technique to classify the Nottingham Tenovus Breast Cancer (NTBC) dataset as no method to do this currently exists. However, the results were poor when compared with semi-manual classification. It is known that the NTBC data is highly non-normal and it was suspected that this affected the poor results. This motivated a further investigation into alternative distance metrics to explore their effect on classification results. Mahalanobis, Euclidean and kernel-based distance metrics were used on 100 sets of randomly-selected labelled data. It was found that ssFCM with Euclidean distance successfully and automatically identified the six classes in close agreement with those of Soria et al. We showed that there is also high agreement in the key features that define the breast cancer classes with those of Soria et al. The superiority of Euclidean distance to Mahalanobis distance is unexpected as it can only generate spherical clusters while Mahalanobis distance can generate hyperellipsoidal ones including spherical ones. We expected Mahalanobis distance to generate the hyperellipsoidal clusters that would best fit NTBC data.
منابع مشابه
An investigation on scaling parameter and distance metrics in semi-supervised Fuzzy c-means
The scaling parameter α helps maintain a balance between supervised and unsupervised learning in semi-supervised Fuzzy c-Means (ssFCM). In this study, we investigated the effects of different α values, 0.1, 0.5, 1 and 10 in Pedrycz and Waletsky’s ssFCM with various amounts of labelled data, 10%, 20%, 30%, 40%, 50% and 60% and three distance metrics, Euclidean, Mahalanobis and kernel-based on th...
متن کاملA methodology for automatic classification of breast cancer immunohistochemical data using semi-supervised Fuzzy c-means
Previously, a semi-manual method was used to identify six novel and clinically useful classes in the Nottingham Tenovus Breast Cancer dataset. 663 out of 1076 patients were classified. The objectives of our work is three folds. Firstly, our primary objective is to use one single automatic method (post-initialisation) to reproduce the six classes for the 663 patients and to classify the remainin...
متن کاملA Preliminary Study on Automatic Breast Cancer Data Classification using Semi-supervised Fuzzy c-Means
Soria et al. have successfully identified six clinically useful and novel subgroups in the Nottingham Tenovus Breast Cancer dataset. However, the methodology used is semi-manual and no single clustering can automatically classify the dataset so far. In this work, two variations of semisupervised Fuzzy c-means (ssFCM) algorithms are explored to classify the Nottingham Tenovus Breast Cancer datas...
متن کاملAn exploration of improvements to semi-supervised fuzzy c-means clustering for real-world biomedical data
This thesis explores various detailed improvements to semi-supervised learning (using labelled data to guide clustering or classification of unlabelled data) with fuzzy c-means clustering (a ‘soft’ clustering technique which allows data patterns to be assigned to multiple clusters using membership values), with the primary aim of creating a semi-supervised fuzzy clustering algorithm that shows ...
متن کاملSemi-Supervised Techniques in Breast Cancer Classification A Comparison between Transductive SVM and Semi-Supervised FCM
The Nottingham Tenovus Breast Cancer data has been successfully classified into six novel and clinically useful subgroups. But the existing technique used is semi manual. In this work, we use Transductive Support Vector Machine (TSVM) and semi-supervised Fuzzy c-means (ssFCM) as automatic techniques to classify the dataset and evaluate our results by using 10-fold Cross-Validation technique. A ...
متن کامل